doc: add cluster manager reference architecture#1209
doc: add cluster manager reference architecture#1209minaelee wants to merge 1 commit intocanonical:mainfrom
Conversation
84f08c8 to
2496821
Compare
edlerd
left a comment
There was a problem hiding this comment.
Excellent start. I have many thoughts and comments below. We can have a chat if you like to clarify on the open issues.
|
|
||
| The MicroCloud Cluster Manager is a centralized tool that provides an overview of MicroCloud deployments. In its initial implementation, it provides an overview of resource usage and availability for all clusters. Future implementations will include centralized cluster management capabilities. | ||
|
|
||
| Cluster Manager stores the data from registered clusters in Postgres and Prometheus databases. This data can be displayed in the Cluster Manager UI, which also links to Grafana dashboards for each MicroCloud. |
There was a problem hiding this comment.
which also links to Grafana dashboards for each MicroCloud
This is a possible extension. By default, the COS stack is not available. So a user will deploy cluster manager and get the manager UI without links to Grafana.
There was a problem hiding this comment.
Does this update work, or would you prefer we did not mention Grafana at all?
This data can be displayed in the Cluster Manager UI, which can be extended to link to Grafana dashboards for each MicroCloud.
Note: This information is from https://github.com/canonical/microcloud-cluster-manager/blob/main/ARCHITECTURE.md and likely should be updated there as well.
There was a problem hiding this comment.
Yes, suggestion sounds good to me. I'll take a note to update the architecture file.
| ```{figure} ../images/cluster_manager_architecture.png | ||
| :alt: A diagram of Cluster Manager architecture | ||
| :align: center | ||
| ``` |
There was a problem hiding this comment.
This diagram is from an earlier development environment. It is mostly correct, but some things have slightly changed.
There was a problem hiding this comment.
Is there an updated diagram, or can you let me know what has changed and I can update it?
There was a problem hiding this comment.
We don't have an updated diagram yet. Things that have changed:
- A single TCP load balancer instead of two, exposing two different domain names. One for the management-api and one for the cluster-connector.
- Postgres service / pg deployment and volume claims are "just" one thing: the postgres charm. The rest is detail of the PG charm. the diagram exposes to much detail of the PG charm internals with assumptions that might be wrong
- Cert manager is to be replaced by a charm implementing the "certificate" charm interface. We might just change the label here.
- k8s secrets/k8s config to be replaced by a juju config layer. Under the hood this is still true. I am not sure how to unify the levels of detail in the diagram to surface k8s internals and charm/juju internals.
- management-api and cluster-connector live together on the same container. Each container is running those two processes. there can be multiple containers to scale out.
- we might want to add Canonical observability stack as an optional extension to the diagram. With prometheus and grafana.
There was a problem hiding this comment.
I can create a task for myself to create an updated diagram.
| That static external IP acts as the gateway to route user traffic to the appropriate Kubernetes load balancers. | ||
|
|
||
| TCP load balancers | ||
| : Two TCP load balancer services distribute traffic to the Management API and Cluster Connector deployments without terminating TLS. Instead, TLS termination is handled directly within each deployment application. This approach is particularly crucial for the Cluster Connector deployment, as it relies on mutual TLS (mTLS) authentication for secure communication. |
There was a problem hiding this comment.
We are using a single Traefik instance that is dealing with the incoming requests, no two load balancers anymore.
There was a problem hiding this comment.
Fixed to:
A TCP load balancer (using a Traefik instance) distributes traffic to the Management API and Cluster Connector deployments without terminating TLS.
| Certificate manager | ||
| : Manages TLS/SSL certificates for secure communication within the Kubernetes cluster. It stores secrets in Kubernetes to be used by various components. The certificates are used by both the Management API and Cluster Connector deployments for HTTPS encryption. |
There was a problem hiding this comment.
We now rely on a charm that implements the certificates interface to provide certificates. This can be the self-signed-certificates charm, as suggested in the readme. We do not rely on the certificate manager k8s app anymore.
There was a problem hiding this comment.
Should the Certificate manager section in lines 33-34 above be removed entirely?
There was a problem hiding this comment.
I think we can remove it, yes.
| Persistent Volume (PV) and Persistent Volume Claim (PVC) | ||
| : The Persistent Volume is the storage resource provisioned for the Postgres deployment. The Persistent Volume Claim is the request for storage by the Postgres deployment to ensure data persistence. |
There was a problem hiding this comment.
We rely on the canonical Postgres charm. How that charm does persistent storage is outside our control.
There was a problem hiding this comment.
What information should we provide in this section instead, or should we remove it entirely?
There was a problem hiding this comment.
I think we can remove it, yes.
| (ref-cluster-manager-architecture-management-ui)= | ||
| ### UI | ||
|
|
||
| The Management API deployment handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens, as well as approve or reject join requests. |
There was a problem hiding this comment.
| The Management API deployment handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens, as well as approve or reject join requests. | |
| The Management API deployment handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens. |
There was a problem hiding this comment.
We can expand here, we serve warnings and metric insights on a high level as well as a list of all registered clusters.
There was a problem hiding this comment.
The Management API handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens. The UI also serves warnings and metric insights on a high level.
I added this, but "on a high level" could bear more explanation. Do you mean through optional extension with Grafana, or something more/else?
There was a problem hiding this comment.
High level means aggregates of instances and microclouds cluster members. Like the number of instances and their status distribution (how many are started/stopped/etc). If the cluster manager is extended with COS/Grafana stack, then Grafana indeed holds detailed information about each instance in every cluster.
| - mTLS authentication check against the matched certificate | ||
| - Store and overwrite data in the `remote_cluster_details` table | ||
|
|
||
| To avoid overwhelming the Cluster Connector deployment, the status endpoints are rate limited. The response sent to the originating cluster includes a delay period (in seconds) that must pass before the next status signal request. |
There was a problem hiding this comment.
All endpoints are rate limited, not just this one.
There was a problem hiding this comment.
Updated to:
(ref-cluster-manager-architecture-rate-limited)=
## Rate limited endpoints
To avoid overwhelming the Cluster Manager, all its endpoints are rate limited. When any endpoint receives a request from a cluster, the response from Cluster Manager includes a delay period (in seconds) that must pass before the next request to that endpoint.
Or did you mean all endpoints for the Cluster Connector deployment only?
Also: do you want to change the term "Cluster Connector deployment" to "Cluster Connector" (like with "Management API deployment" to "Management API") or does it make sense to keep the word "deployment" here?
There was a problem hiding this comment.
I think the suggestion is slightly confusing. We have rate limiting in place to avoid overwhelming the cluster manager, yes.
The functionality to signal to the Microcloud in a response when they should call in again is unrelated to the rate limiting, though.
Signed-off-by: Minae Lee <minae.lee@canonical.com>
2496821 to
a2c04aa
Compare
Add reference architecture documentation for MicroCloud Cluster Manager.